Python 3.8.15 | packaged by conda-forge | (default, Nov 22 2022,
08:49:35)
Type 'copyright', 'credits' or 'license' for more information
IPython 8.7.0 -- An enhanced Interactive Python. Type '?' for
help.
The following notebook explores the use of semantic embeddings to create taxonomies of entities, with the goal of creating an ontology of the ARIA dataset. It leverages embeddings from the tags identified in articles extracted from OpenAlex, and uses them to cluster entities into groups. This taxonomy is assumed hierarchical.
Multiple options are explored to create the taxonomy, including:
The utils for this notebook include all necessary functions to create the taxonomy, including class ClusteringRoutine that performs any of the clustering methods described above. The function run_clustering_generators is used to run all clustering methods and return the results in a dictionary. The function make_dataframe is used to create a dataframe with the results of the clustering methods. The function make_plots is used to create a series of plots to visualize the results of the clustering methods. The function make_cooccurrences is used to create a co-occurrence matrix of the clustering results. The function make_subplot_embeddings is used to create a series of subplots with the embeddings of the entities in the taxonomy. [TODO] Pipeline should simply create the necessary dataframes and save them to S3. [TODO] Plots & validation / silhouettes should be in subfolder of the pipeline, called "validation" or "evaluation".
import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)
from IPython.display import display
import boto3, pickle, json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import umap.umap_ as umap
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from hdbscan import HDBSCAN
from itertools import product
from toolz import pipe
from collections import defaultdict
from itertools import chain
from functools import partial
from dap_aria_mapping import PROJECT_DIR, BUCKET_NAME, logger
from dap_aria_mapping.utils.semantics import (
make_subplot_embeddings,
make_dataframe,
make_plots,
make_cooccurrences,
run_clustering_generators
)
np.random.seed(42)
The entity tags are obtained from OpenAlex. Filtering is applied to remove entities that are too frequent or too infrequent. The entities are then embedded using the SPECTER model. In addition, two-dimensional representations of the embeddings are obtained using UMAP, for plotting purposes.
s3 = boto3.client("s3")
try:
try:
logger.info("Downloading embeddings from S3")
embeddings_object = s3.get_object(
Bucket=BUCKET_NAME,
Key="outputs/embeddings/embeddings.pkl"
)
embeddings = pickle.loads(embeddings_object["Body"].read())
except:
logger.info("Failed to download from S3. Attempting to load from local instead")
with open(f"{PROJECT_DIR}/outputs/embeddings.pkl", "rb") as f:
embeddings = pickle.load(f)
except:
logger.info("Failed to load embeddings. Running pipeline with default (test) parameters")
import subprocess
subprocess.run(
f"python {PROJECT_DIR}/dap_aria_mapping/pipeline/embeddings/make_embeddings.py",
shell=True
)
with open(f"{PROJECT_DIR}/outputs/embeddings.pkl", "rb") as f:
embeddings = pickle.load(f)
embeddings = pd.DataFrame.from_dict(embeddings).T
embeddings = embeddings.iloc[:5_000]
# UMAP
params = [
["n_neighbors", [20]],
["min_dist", [0.1]],
["n_components", [8]],
]
keys, permuts = ([x[0] for x in params], list(product(*[x[1] for x in params])))
param_perms = [{k: v for k, v in zip(keys, perm)} for perm in permuts]
for perm in param_perms:
embeddings_2d = umap.UMAP(**perm).fit_transform(embeddings)
fig = plt.figure(figsize=(10, 10))
plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1], s=1)
fig.suptitle(f"{perm}")
plt.show()
2023-01-12 12:00:00,628 - botocore.credentials - INFO - Found credentials in shared credentials file: ~/.aws/credentials 2023-01-12 12:00:00,757 - dap_aria_mapping - INFO - Downloading embeddings from S3
The following clustering routine iteratively clusters the entity embeddings using Agglomerative Clustering and KMeans. At each level, clustering is performed on the subsets that were created at the previous level.
cluster_configs = [
[
KMeans,
[
{"n_clusters": 5, "n_init": 5}, # parent level
{"n_clusters": 5, "n_init": 5}, # nested level 1
{"n_clusters": 5, "n_init": 5}, # nested level 2
{"n_clusters": 10, "n_init": 5} # nested level 3
],
],
[
AgglomerativeClustering,
[
{"n_clusters": 5}, # parent level
{"n_clusters": 5}, # nested level 1
{"n_clusters": 5}, # nested level 2
{"n_clusters": 10} # nested level 3
],
],
]
# run clustering generators
cluster_outputs_s, plot_dicts = run_clustering_generators(cluster_configs, embeddings)
# plot results
fig, axis = plt.subplots(2, 4, figsize=(32, 16), dpi=200)
for idx, (cdict, cluster) in enumerate(plot_dicts):
_, lvl = divmod(idx, 4)
make_subplot_embeddings(
embeddings=embeddings_2d,
clabels=[int(e) for e in cdict.values()],
axis=axis.flat[idx],
label=f"{cluster[-1]} {str(lvl)}",
s=4,
)
fig.savefig(PROJECT_DIR / "outputs" / "figures" / "semantic_taxonomy" / "clustering_strict.png")
# print silhouettes
for output in cluster_outputs_s:
print(
"Silhouette score - {} clusters - {}: {}".format(
output["model"][-1].__module__,
output["model"][-1].get_params()["n_clusters"],
output["silhouette"],
)
)
Silhouette score - sklearn.cluster._kmeans clusters - 5: [0.045005124] Silhouette score - sklearn.cluster._kmeans clusters - 5: [0.045005124, 0.015171775] Silhouette score - sklearn.cluster._kmeans clusters - 5: [0.045005124, 0.015171775, 0.0071048234] Silhouette score - sklearn.cluster._kmeans clusters - 2: [0.045005124, 0.015171775, 0.0071048234, -0.0071440507] Silhouette score - sklearn.cluster._agglomerative clusters - 5: [0.026320975] Silhouette score - sklearn.cluster._agglomerative clusters - 5: [0.026320975, -0.0022284377] Silhouette score - sklearn.cluster._agglomerative clusters - 5: [0.026320975, -0.0022284377, -0.00069950375] Silhouette score - sklearn.cluster._agglomerative clusters - 10: [0.026320975, -0.0022284377, -0.00069950375, 0.019138228]
The following clustering routine iteratively clusters the entity embeddings using KMeans. At each level, clustering is performed on the subsets that were created at the previous level. The number of clusters at each level is allowed to vary, being determined by the size of the parent cluster.
cluster_configs = [
[
KMeans,
[
{"n_clusters": 5, "n_init": 5}, # parent level
{"n_clusters": 20, "n_init": 5},# nested level 1, total n_clusters is 20+
{"n_clusters": 20, "n_init": 5},# nested level 2, total n_clusters is 20+
{"n_clusters": 40, "n_init": 5},# nested level 2, total n_clusters is 40+
],
],
]
# run clustering generators with imbalanced nested clusters
cluster_outputs_simb, plot_dicts = run_clustering_generators(cluster_configs, embeddings, imbalanced=True)
# plot results
fig, axis = plt.subplots(1, 4, figsize=(32, 8), dpi=200)
for idx, (cdict, cluster) in enumerate(plot_dicts):
labels = [int(e) for e in cdict.values()]
di = dict(zip(sorted(set(labels)), range(len(set(labels)))))
labels = [di[label] for label in labels]
_, lvl = divmod(idx, 4)
make_subplot_embeddings(
embeddings=embeddings_2d,
clabels=labels,
axis=axis.flat[idx],
label=f"{cluster[-1]} {str(lvl)}",
s=4,
)
fig.savefig(PROJECT_DIR / "outputs" / "figures" / "semantic_taxonomy" / "clustering_strict_imb.png")
# print silhouettes
for output in cluster_outputs_simb:
print(
"Silhouette score - {} clusters - {}: {}".format(
output["model"][-1].__module__,
output["model"][-1].get_params()["n_clusters"],
output["silhouette"],
)
)
Silhouette score - sklearn.cluster._kmeans clusters - 5: [0.044753313] Silhouette score - sklearn.cluster._kmeans clusters - 3: [0.044753313, -0.0075814454] Silhouette score - sklearn.cluster._kmeans clusters - 2: [0.044753313, -0.0075814454, -0.002601674] Silhouette score - sklearn.cluster._kmeans clusters - 2: [0.044753313, -0.0075814454, -0.002601674, -0.0036360992]
The following approach iteratively clusters the entity embeddings
using any sklearn method that supports the
predict_proba method. No notion of level exists
through this approach: more fine-grained clusterings are agnostic
about the parent cluster output. Including several lists of
parameter values will produce outputs for the Cartesian product of
all parameter values within a clustering method.
cluster_configs = [
[KMeans, [{"n_clusters": [5, 5, 5, 10], "n_init": 5}]], # level 1, level 2, level 3, level 4
[AgglomerativeClustering, [{"n_clusters": [5, 5, 5, 10]}]], # level 1, level 2, level 3, level 4
[DBSCAN, [{"eps": [0.15, 0.25], "min_samples": [8, 16]}]], # level 1, level 2, level 3, level 4
[HDBSCAN, [{"min_cluster_size": [4, 8], "min_samples": [8, 16]}]], # level 1, level 2, level 3, level 4
]
# run clustering generators with fuzzy clusters
cluster_outputs_f_, plot_dicts = run_clustering_generators(cluster_configs, embeddings)
# plot results
fig, axis = plt.subplots(4, 4, figsize=(40, 40), dpi=200)
for idx, (cdict, cluster) in enumerate(plot_dicts):
make_subplot_embeddings(
embeddings=embeddings_2d,
clabels=[int(e) for e in cdict.values()],
axis=axis.flat[idx],
label=f"{cluster[-1].__module__}",
cmap="gist_ncar",
)
fig.savefig(PROJECT_DIR / "outputs" / "figures" / "semantic_taxonomy" / "clustering_fuzzy.png")
# print silhouettes
for cluster in cluster_outputs_f_:
print(
"Silhouette score - {}: {}".format(
cluster["model"][-1], cluster["silhouette"]
)
)
Silhouette score - KMeans(n_clusters=5, n_init=5): [0.043042015] Silhouette score - KMeans(n_clusters=5, n_init=5): [0.043309145] Silhouette score - KMeans(n_clusters=5, n_init=5): [0.043810967] Silhouette score - KMeans(n_clusters=10, n_init=5): [0.016806971] Silhouette score - AgglomerativeClustering(n_clusters=5): [0.026320975] Silhouette score - AgglomerativeClustering(n_clusters=5): [0.026320975] Silhouette score - AgglomerativeClustering(n_clusters=5): [0.026320975] Silhouette score - AgglomerativeClustering(n_clusters=10): [0.003051781] Silhouette score - DBSCAN(eps=0.15, min_samples=8): [0] Silhouette score - DBSCAN(eps=0.15, min_samples=16): [0] Silhouette score - DBSCAN(eps=0.25, min_samples=8): [0] Silhouette score - DBSCAN(eps=0.25, min_samples=16): [0] Silhouette score - HDBSCAN(min_cluster_size=4, min_samples=8): [-0.0589354] Silhouette score - HDBSCAN(min_cluster_size=4, min_samples=16): [-0.038069718] Silhouette score - HDBSCAN(min_cluster_size=8, min_samples=8): [-0.021693144] Silhouette score - HDBSCAN(min_cluster_size=8, min_samples=16): [-0.17777418]
This approach uses a single run of any sklearn clustering method
that supports a children_ attribute. The children_ attribute is
used to recreate th dendrogram that produced the clustering, which
is then used to create the taxonomy. The climbing algorithm
advances one union of subtrees at a time. The number of levels is
determined by the dendrogram_levels parameter.
cluster_configs = [[AgglomerativeClustering, [{"n_clusters": 100}]]]
# run clustering generators with dendrograms
cluster_outputs_d, plot_dicts = run_clustering_generators(cluster_configs, embeddings, dendrogram_levels=6)
# plot results
fig, axis = plt.subplots(2, 3, figsize=(24, 16), dpi=200)
for i, ax in zip(range(6), axis.flat):
make_subplot_embeddings(
embeddings=embeddings_2d,
clabels=[int(e[i]) for e in cluster_outputs_d["labels"].values()],
axis=ax,
label=f"denrogram - level {i}",
s=4,
)
fig.savefig(PROJECT_DIR / "outputs" / "figures" / "semantic_taxonomy" / "clustering_dendrogram.png")
This approach uses any number of nested KMeans clustering runs. After a given level, the centroids of the previous level are used as the new data points for the next level.
cluster_configs = [
[
KMeans,
[
{"n_clusters": 200, "n_init": 5, "centroids": False},
{"n_clusters": 50, "n_init": 5, "centroids": True},
{"n_clusters": 20, "n_init": 5, "centroids": True},
{"n_clusters": 5, "n_init": 5, "centroids": True},
],
],
]
# run clustering generators with centroids
cluster_outputs_c, plot_dicts = run_clustering_generators(
cluster_configs, embeddings, embeddings_2d=embeddings_2d
)
# [HACK] flip order, should be fixed in run_clustering_generators (should run highest level → lowest level)
for output_dict in cluster_outputs_c:
for k,v in output_dict["labels"].items():
output_dict["labels"][k] = v[::-1]
output_dict["silhouette"] = output_dict["silhouette"][::-1]
# plot results
fig, axis = plt.subplots(1, 4, figsize=(32, 8), dpi=200)
for idx, cdict in enumerate(cluster_outputs_c):
if not cdict.get("centroid_params", False):
axis[idx].scatter(
embeddings_2d[:, 0],
embeddings_2d[:, 1],
c=[e for e in cdict["labels"].values()],
s=1,
)
else:
axis[idx].scatter(
cdict["centroid_params"]["n_embeddings_2d"][:, 0],
cdict["centroid_params"]["n_embeddings_2d"][:, 1],
c=cdict["model"][idx].labels_,
s=cdict["centroid_params"]["sizes"],
)
print(f"Silhouette score ({idx}): {cdict['silhouette']}")
fig.savefig(PROJECT_DIR / "outputs" / "figures" / "semantic_taxonomy" / "clustering_centroids.png")
Silhouette score (0): [0.020928387] Silhouette score (1): [-0.009255368, 0.020928387] Silhouette score (2): [0.04395641, -0.009255368, 0.020928387] Silhouette score (3): [0.10650885, 0.04395641, -0.009255368, 0.020928387]
This section outputs silhouette scores for all relevant outputs above. It also constructs barplots of the cluster sizes for each level of the taxonomy across approaches.
# Harmonize cluster outputs for analysis
# [HACK] - fix this. For exports, I create a single dictionary for the fuzzy clusters
cluster_outputs_f = []
for group in ["sklearn.cluster._kmeans", "sklearn.cluster._agglomerative"]:
dict_group = {
"labels": defaultdict(list),
"model": [],
"silhouette": [],
"centroid_params": None
}
cluster_outpu = [x for x in cluster_outputs_f_ if x["model"][0].__module__ == group]
for clust in cluster_outpu:
for k, v in clust["labels"].items():
dict_group["labels"][k].append(v[0])
dict_group["model"].append("_".join([clust["model"][0].__module__.replace(".", ""), str(clust["model"][0].get_params()["n_clusters"])]))
dict_group["silhouette"].append(clust["silhouette"][0])
cluster_outputs_f.append(dict_group)
strict_kmeans_df = make_dataframe(cluster_outputs_s[3], "_strict")
strict_agglom_df = make_dataframe(cluster_outputs_s[7], "_strict")
strict_kmeans_imb_df = make_dataframe(cluster_outputs_simb[-1], "_strict_imbalanced")
fuzzy_kmeans_df = make_dataframe(cluster_outputs_f[0], "_fuzzy")
fuzzy_agglom_df = make_dataframe(cluster_outputs_f[1], "_fuzzy")
dendrogram_df = make_dataframe(cluster_outputs_d, "")
centroid_kmeans_df = make_dataframe(cluster_outputs_c[-1], "_centroids", cumulative=False)
make_plots(strict_kmeans_df)
make_plots(strict_agglom_df)
make_plots(strict_kmeans_imb_df)
make_plots(fuzzy_kmeans_df)
make_plots(dendrogram_df)
make_plots(centroid_kmeans_df)
results = {
"kmeans_strict": cluster_outputs_s[3]["silhouette"],
"agglom_strict": cluster_outputs_s[7]["silhouette"],
"kmeans_strict_imb": cluster_outputs_simb[-1]["silhouette"],
"kmeans_fuzzy": cluster_outputs_f[0]["silhouette"],
"agglom_fuzzy": cluster_outputs_f[1]["silhouette"],
"agglomerative_dendrogram": cluster_outputs_d["silhouette"],
"kmeans_centroid": cluster_outputs_c[-1]["silhouette"],
}
results = {"_".join([k,str(id)]): e for k,v in results.items() for id, e in enumerate(v)}
silhouette_df = pd.DataFrame(results, index=["silhouette"]).T.sort_values(
"silhouette", ascending=False
)
display(silhouette_df)
| silhouette | |
|---|---|
| kmeans_centroid_0 | 0.106509 |
| agglomerative_dendrogram_0 | 0.058294 |
| agglomerative_dendrogram_1 | 0.052698 |
| kmeans_strict_0 | 0.045005 |
| kmeans_strict_imb_0 | 0.044753 |
| kmeans_centroid_1 | 0.043956 |
| kmeans_fuzzy_2 | 0.043811 |
| kmeans_fuzzy_1 | 0.043309 |
| kmeans_fuzzy_0 | 0.043042 |
| agglom_strict_0 | 0.026321 |
| agglomerative_dendrogram_2 | 0.026321 |
| agglom_fuzzy_2 | 0.026321 |
| agglom_fuzzy_1 | 0.026321 |
| agglom_fuzzy_0 | 0.026321 |
| kmeans_centroid_3 | 0.020928 |
| agglom_strict_3 | 0.019138 |
| kmeans_fuzzy_3 | 0.016807 |
| kmeans_strict_1 | 0.015172 |
| kmeans_strict_2 | 0.007105 |
| agglom_fuzzy_3 | 0.003052 |
| agglom_strict_2 | -0.000700 |
| agglom_strict_1 | -0.002228 |
| kmeans_strict_imb_2 | -0.002602 |
| kmeans_strict_imb_3 | -0.003636 |
| agglomerative_dendrogram_3 | -0.006234 |
| kmeans_strict_3 | -0.007144 |
| kmeans_strict_imb_1 | -0.007581 |
| kmeans_centroid_2 | -0.009255 |
| agglomerative_dendrogram_4 | -0.022471 |
| agglomerative_dendrogram_5 | -0.024204 |
Following the approach of Juan in the AFS repository, we combine the clustering methods to produce a matrix of entity co-occurrences. The objective is to apply community detection algorithms on this.
list_dfs = [
strict_kmeans_df,
strict_agglom_df,
strict_kmeans_imb_df,
fuzzy_kmeans_df,
fuzzy_agglom_df,
dendrogram_df,
centroid_kmeans_df
]
meta_cluster_df = (
pd.concat(list_dfs, axis=1)
.reset_index()
.rename(columns={"index": "tag"})
)
cooccur_dict = make_cooccurrences(meta_cluster_df)
cooccur_df = pd.DataFrame(cooccur_dict, index=meta_cluster_df["tag"], columns=meta_cluster_df["tag"])
cooccur_df.head(10)
| tag | Baffin Bay | Remote Sensing Systems | Avalanche breakdown | Product-service system | Solow–Swan model | Common Interface | VRK1 | Li Ka-shing | Trimethylene carbonate | Electrode potential | ... | Decarboxylation | Venule | SKAP1 | Egon Brunswik | Fast multipole method | Sloan letters | Great Famine (Ireland) | Hampstead | WORC (AM) | Cataglyphis |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| tag | |||||||||||||||||||||
| Baffin Bay | 30 | 1 | 7 | 14 | 14 | 18 | 17 | 24 | 1 | 1 | ... | 1 | 14 | 17 | 16 | 14 | 21 | 9 | 19 | 15 | 20 |
| Remote Sensing Systems | 1 | 30 | 3 | 1 | 2 | 1 | 1 | 1 | 3 | 4 | ... | 4 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| Avalanche breakdown | 7 | 3 | 30 | 9 | 9 | 7 | 7 | 7 | 10 | 10 | ... | 8 | 2 | 7 | 2 | 10 | 7 | 3 | 7 | 8 | 7 |
| Product-service system | 14 | 1 | 9 | 30 | 15 | 14 | 14 | 14 | 1 | 1 | ... | 1 | 9 | 14 | 9 | 16 | 14 | 13 | 14 | 22 | 14 |
| Solow–Swan model | 14 | 2 | 9 | 15 | 30 | 14 | 15 | 14 | 1 | 2 | ... | 2 | 8 | 15 | 8 | 22 | 14 | 10 | 17 | 16 | 17 |
| Common Interface | 18 | 1 | 7 | 14 | 14 | 30 | 17 | 18 | 1 | 1 | ... | 1 | 14 | 17 | 13 | 15 | 20 | 9 | 22 | 14 | 21 |
| VRK1 | 17 | 1 | 7 | 14 | 15 | 17 | 30 | 17 | 1 | 1 | ... | 1 | 11 | 23 | 11 | 14 | 17 | 9 | 18 | 15 | 18 |
| Li Ka-shing | 24 | 1 | 7 | 14 | 14 | 18 | 17 | 30 | 1 | 1 | ... | 1 | 14 | 17 | 15 | 14 | 20 | 9 | 19 | 15 | 20 |
| Trimethylene carbonate | 1 | 3 | 10 | 1 | 1 | 1 | 1 | 1 | 30 | 25 | ... | 15 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| Electrode potential | 1 | 4 | 10 | 1 | 2 | 1 | 1 | 1 | 25 | 30 | ... | 17 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
10 rows × 5000 columns